Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: Administrator, Curator, or StaffArkiver Date: Jan 7, 2014 8:16am
Forum: faqs Subject: Re: the captures of my site are so sparse

I think it is very good posibble that you lost a lot of traffic to your website during the second half of 2012 due to the robots.txt file with the delay of 30 seconds. As has been mentioned before, a crawl delay of as big as 30 seconds is not good for the search engine crawlers, which will result in your website slowly dissappearing from google, yahoo and other search engines... That can cost you a lot of traffic.

In one of your older posts you said hackers edited the robots.txt file and added exclusions for all kinds of search engine bots. Adding a delay of 30 seconds is also good to have a website not included in the search engines searches.


I'm now running a quick link discovery program on your website to discover all the urls and I'll then add them to wayback machine.

Reply to this post
Reply [edit]

Poster: Medworks Date: Jan 7, 2014 11:43am
Forum: faqs Subject: Re: the captures of my site are so sparse

Yeah, if you look at the robots.txt file as it appeared at the beginning of 2011, you'll see all the search engine crawlers that the hacker excluded. I should have realized I didn't even have a robots.txt file before the hacker at all and just deleted it but I (not knowing about the wayback machine at the time so I could not verify this) assumed that it WAS part of my website all along and that what they did to it was merely add the exclusions to all the web spiders. I was afraid removing robots.txt altogether would make the search engines not find my website, I contacted my mom who originally made the site and was solely responsible for it until October 8, 2007 and she said she had a robots.txt file, that must be the thing that wasn't properly archived before the 3 year gap (ending early 2008, so it makes sense I did something to it - but of course at the time I asked my mom about it, right after the hackers' attacks, again I didn't know about web.archive.org so again I didn't know that anything happened in early 2008, a few months after I started with it). I should have just removed it but I just removed the text which I thought was not innocuous, unfortunately the delay 30 was not innocuous and I thought it was.

Thank you for running this discovery program on my behalf. I am happy to have all my pages added to the wayback machine. I see the main page has been archived again as of yesterday, but not for instance the shipping cost page http://medexamtools.com/shippingrules.htm, that one for instance still was only last in October, and nothing at all for it in 2012. So I can click on the archive of January 6, 2014 and and then click on the link within that to the shipping cost page and it links not to a capture of January 6, but of October somethingorother 2013, I'm surprised it doesn't do the whole website all together each time it takes a capture, but I guess it's one page at a time, and that it happens whenever it decides to do each one on an individual basis. Hopefully my whole site will be archived normally after you have done what you are doing. Thank you once again.

Good grief, I think the robots.txt delay of 30 may not have done much of anything at all, now that I think about it. It was precisely the second half of 2012 when business really went to crap, I was getting just under 2000 dollars a month of business pretty reliably the entire time before that and only now am I getting 400 a month - and from old, returning customers, not new ones. That makes me think that maybe it was medexamtools.info was the culprit, sabotaging it, because that was what was right before the decline, not the emergence of the bogus robots.txt file! It could be a coincidence, a suspiciously timed delay reaction to the robots.txt file, but I'd best do something with medexamtools.info as soon as I can!

Thank you once again for all your attention and investigation.

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or StaffArkiver Date: Jan 7, 2014 1:00pm
Forum: faqs Subject: Re: the captures of my site are so sparse

Just to make you aware of it: there is a small error in your website medexamtools.info that causes two slashes ("//") to be in the link from the pages of your the website to other pages of the website.

Example:
Go to the front page: http://www.medexamtools.info/
Click on a link from the front page and you will be directed to a link with two slashes: http://www.medexamtools.info//products.php?140 (right before "products.php?140".

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or StaffArkiver Date: Jan 7, 2014 12:35pm
Forum: faqs Subject: Re: the captures of my site are so sparse

Hmm, I see... They really excluded your site from quite some crawlers... I see the folder "/cgi-bin/"? What did that folder contain?
Removing a robots.txt file won't exclude your website from search engines. It will only say to search engines that there are "no rules" when crawling your website.

I haven't started yet to crawl your whole website, I am still discovering. Running the program now for around 7 hours and discovered around 220.000 links. Tomorrow I will probably be finished and then you can see your full website in the wayback machine. I'll keep you informed about it!

Yes, as I said in my reply to your other post, it can be very bad for the number of customers if you have two websites that share the traffic between each other.
I will also start a crawl of the medexamtools.info website since you say you are going to something with it (not sure if you are going to delete it, or change it dramatically).


I'll keep you informed about my progress and you'll see the websites in the wayback machine... ;)

Reply to this post
Reply [edit]

Poster: Medworks Date: Jan 7, 2014 7:18pm
Forum: faqs Subject: Re: the captures of my site are so sparse

Rather than replying to your 3 replies since I last looked separately, I'll consolidate them into 1 to be concise and not so confusing. I don't know if you (Arkiver) are also Michael Ronayne but I'll do that one separately.

I don't know what was in cgi-bin. I don't seem to have one now. Not anywhere in it. When I try putting medexamtools.com/cgi-bin into any of the wayback machine archives I get a not archived message, so I don't know where you're seeing it. If there was a cgi-bin then it was probably generated by hackers, or it could have possibly been generated by inmotionhosting one of the times when I called them and got customer service and they helped me do something on the site. But I do have a _vti_bin directory if that's what you meant. It's just something that microsoft frontpage generates, which is involved in its operations. If I mess with that too much, the "search our site" function is liable to stop working.

I guess I should probably just get rid of the robots.txt file then. Unless you know of any crawlers that are bad and that I SHOULD exclude.

I see, 2 sites with mediocre results or 1 with twice as many. Basically one site that gets X hits a day or two sites that each get less than X/2 hits a day. In that case, I wonder what I might do with medexamtools.info. The odd thing is though, that the search engines already seem to treat the individual pages comprising medexamtools.com separately. In other words, if I search "dejerine reflex hammer" I won't get medexamtools.com in google, I get www.medexamtools.com/r6-page.htm. And if I search "troemner reflex hammer", I get 3 consecutive results, results 6, 7 and 8 on google, www.medexamtools.com/troemnernew.htm is 6, www.medexamtools.com/r8-page.htm is 7 and www.medexamtools.com/troemnerstreamlined.htm is 8. So it's more complicated than just one site getting webtraffic or 2 splitting it up, my 1 working website already has split results. I have also never seen medexamtools.info in search results. Obviously I understand your reasoning that it's better to get one website than 2 that do the exact same thing though. I wonder if I might put the electronics in medexamtools.info because it's a completely different category, after all old customers know about medexamtools.com. Or maybe just to have each one use different keywords. Though I'm not that good at it, and it would be a massive undertaking. Oh, my medexamtools.info is looking like such a doomed venture, I just wanted to test something to replace the frontpage site.

medexamtools.info putting // in a bad place between links between pages. I have a feeling that's not the biggest problem with poor medexamtools.info but it's one I'll try to keep in the back of my mind. I really need to try to solve it.

Yeah, being hit by that car ruined a lot. If you want to hear my rant about it, the woman ran a red light to hit me on a crosswalk. Then she had the nerve to tell the cop that I was a crazy jogger who just sprinted out of nowhere and took a swan dive into her windshield while her car was stopped and while she was busy looking to the left and right (actually she wrote left twice on her witness statement and crossed out the second one and wrote right). And that's what the cop said he thought when he interrogated me in my hospital room as I was coming out of a coma. He told me he had 5 witnesses who all agreed on that version of events and was giving me a citation for jaywalking and she got no consequences whatsoever. He was lying. The witnesses agreed with what I said. The police report followed that story, the newspaper and her insurance based off the police report, and I found out weeks later when they made the witness statements available (after the jaywalking ticket was due and after the court date where I might dispute it) that the witnesses all said the same thing I said (or that they didn't witness the actual impact), that she was looking only to the left, only concerned with cars coming from the left, and ignored everything to the right and directly in front of her, and hit the gas and got me. She gave me 100+k$ in medical bills and I had the pleasure of being coerced by a lawyer under threat of paying HIM money I didn't have to accept the 25k$ (all the insurance coverage she had) from allstate which came with the string attached that she was absolved from all further suits. She took the time before that to hide her assets, including her divorce settlement, and claimed to only have her fat annuity and fat social security check, both of which are more than I have, even if her lie was true, but she's sitting on all sorts of money, no penalties for perjury about either what happened or her assets (I can't legally do anything to verify if she was telling the truth - and what did she do, spend hundreds of thousands of dollars in 2 years since her divorce and then settle down for being almost as poor as me, I don't believe it), and is still driving around her death machine with 25k insurance coverage, ready to ruin the life of the next person 40 years her junior (play the indy game Turbo Granny, that's pretty much her), no moving citation, doesn't have to take a driving test or anything, and the cop is still probably interrogating semiconscious people in hospital rooms as they come out of comas from having their skulls fractured to drive into them the version of events that involves the least amount of paperwork for him. Also she took my sense of hearing away in my left ear and replaced it with the neverending sound of nails on a chalkboard at 100 decibels, and my IQ dropped from 140 to 110. Hurray for the wheels of justice. It just goes to show you, not EVERYTHING that happens in the US legal system is some trespasser getting bitten by a dog and suing the homeowner for a million dollars, or the old woman who burned herself on mcdonalds coffee and getting a million dollars out of them. But they'd both better hope I'm never in a position of power, I'll tell you that. If I choose to live that long (I can't imagine living like this for years and years and years though, just being away from a loud fan running is absolute agony so I'm a shut-in now), I'm certainly going to show up at her funeral with a westboro baptist church style sign to denigrate her and troll her family though.

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or StaffArkiver Date: Jan 7, 2014 10:42pm
Forum: faqs Subject: Re: the captures of my site are so sparse

Well, you can use the robots.txt to tell crawlers to not crawl your website, but it won't work if the crawlers are "bad". The google crawler, IA crawler and many other crawler stick to the rules of the robots.txt, but crawler can also just not follow the rules from the robots.txt file and do whatever they want. So you can exclude bad crawlers, but it wouldn't help a lot...

I don't think I can help you a lot now with the pages and the search results in google from your website. I think you should take a look at some Website SEO (Search Engine Optimalization) articles on how to make your site better searchable in google and other search engines.

It sounds horrible what happened to you with the car accident... There are some really horrible and disgusting people on this planet... :(

Reply to this post
Reply [edit]

Poster: Medworks Date: Jan 8, 2014 1:39pm
Forum: faqs Subject: Re: the captures of my site are so sparse

Well, yes, obviously it would be completely outragious for me to expect you to help me redesign my website(s), I'm astonished you did as much as you did.

Just a little FYI though, I just discovered that in fact there IS a reason to have a crawler delay in there. My website was SHUT DOWN and suspended by my webhost 3 hours ago because they got 100000 requests from a bot in the netherlands. If as you say the "bad bots" don't obey the directive to delay the time in the text file then this was completely coincidental that it happened right when I removed the delay from robots.txt, but more likely than not, it was a consequence of it, which means it was a "good bot", i.e. one I don't want to ban, but it just did it so fast that it angered the webhost. The inmotionhosting representative actually said the opposite of what you did, he actually said they generally recommended a delay of 30 seconds. But I got him to put in a delay of 1 second before putting the site live again and removing the suspension. So it's apparently bad to remove it entirely. I can only guess that the environment of the internet is different now than it was at the beginning of 2011 because as you noted, there was a whole 3 year period of time there was no robots.txt file at all, yet this thing where the single IP address slams the website with requests because there's no time delay in a robots.txt file never happened in all that time, yet it happened essentially as soon as I removed the delay line from robots.txt 2 days ago here in 2014, so you might consider that it's good to have a delay after all, just not 30 seconds.

Well thanks for all your help. The IP address in netherlands wasn't anything associated with alexa or the wayback machine, was it? You said you were doing somethingorother that would count in the hundreds of thousands with my site. The problem was just that it was too much, too fast.

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or StaffArkiver Date: Jan 8, 2014 10:47pm
Forum: faqs Subject: Re: the captures of my site are so sparse

..... oh...
Gosh, I think I actually am that ip adress in the Netherlands... :/
Well, I didn't expect that to happen, I am very very very sorry!! :(

This post was modified by Arkiver on 2014-01-09 06:47:55